Analyzing the Indian Premier League

a tutorial by Hiteesh Nukalapati

Introduction

What is the Indian Premier League(IPL)?
It is the world's most popular, most watched, most sought-after cricket league.

What is Criket?
Criket is 11 member team sport consisting of batters and bowlers. It can be played in a stadium of any shape and size as long as the center has 22-yard rectangular box called the pitch. There are two innings(phases) in a match and the team that wins the toss, get to pick what they want to do first (i.e bat first or bowl first). So the bowlers bowl to the batter and A batter scores by hitting the ball to get runs whereas the bowler aims to get wickets (i.e. the batter out). There are multiple formats of the game but the IPL follows the T20 format which stands for 20-20 which simply means that each team get to bowl 20 overs (each over consists of 6 balls, which means every innings has a maximum of 120 balls). A team can win the match in two main ways:

This is all the information you will need to understand and follow the tutorial but I encourage you to learn more about the game at https://en.wikipedia.org/wiki/Cricket. It is a beautiful game!

So why did I choose the IPL to create a tutorial?
It simply reminds my of India, cricket is a big sport in India and I care about it deeply. This seasons IPL was posponed indefinitely due to the raging COVID-19 Pandemic that has taken the country by storm. I planned on going to India this summer and looks like that is not happening. So i though what if I try to better understand the game, maybe along the way I will find some interesting facts about it and I will also miss the game a little less (Win-Win!!)

Set up

Locating and Loading the data

After a bit of googling, I found a data set that contains ball-by-ball data for IPL matches between 2008 and 2016 on kaagle: https://www.kaggle.com/manasgarg/ipl
This dataset consists of two CSV's (Common format of sharing data, stands for Comma Separated Value):


I will load both these files into two separate Pandas Dataframes. There is an inbuilt library function called read_csv(). This function simply takes the path to the file and converts it to a DataFrame (which is a 2D data structure used to store data, like a SQL table).

Tidying the data

When we look at the data, we realize that in the matches table(DataFrame), the umpire3 column is empty, filled with NaN's. So we can go ahead and and drop that column from our table.
When we look at the deliveries table, we see that a few entries in columns are NaN's.
Now, we must not drop these column (because they indicate something important when they actually contain a value) but we must make these columns useful and usable to ourselves. To do this, I will replace all NaN values in the deliveries table with 0's.
Doing this in this situation is very useful as we are not missing data, instead here, NaN's are used to indicate that the specific event did not occur. For this data, this is the best way to go forward but this might not always be the best choice, it always will depend on the type of data and analysis we want to perform.
The team Rising Pune Supergiants cahnged their name to remove the trailing s, I make sure that this is also accounted for by renaming them to RPS.
For further simplicity I am going to abbreviate team names so that it is easier for us to type, display and utilize the same. We got all the unique teams in the data by running matches['team1'].unique()

Basic Analysis

We've used a few pandas library functions to perfrom some basics analysis.
As we can see, over the years, the tournament has been played at 30 locations (grounds, to be more precise) most of these locations are in India, but if we look closely, some of them are in the UK, UAE and South Africa!!
The MVP, with the most Man of the Match awards is none other than Chris Gayle! (If you follow cricket closely, you'll know what an impact the player has on the team and the game) Lets look at every teams success so far:

As we can see, MI (Mumbai Indians) is the most successful IPL team, infact they've the tournamnet 5 times now!
As we can see, visualization is an important aspect of data analysis, it lets us put things in perspective. Here, we used the matplotlib and seaborn libraries to plot the above graph. We created a barplot with using the seaborn library barplot function. We always have to send data to a visualization function and sending the right data is very important. So, above, we created a new dataframe called team_wins_df that holds the exact data we need to plot the above graph.

Further Analysis and Visualization

In the game of cricket, toss is a very important factor. Cricket gives the team winning the toss an edge, because they get their preffered decision (to either bowl first or bat first) this decision is made by the captain and coach after taking into considerations like ground size, due(mist) factor, opposition, time of the day etc.
Lets take a closer look at Toss, Toss Decisions and the way it affects the game

We use matches.shape[0] to get the total number of matches and we use matches['toss_decision'].value_counts() to count the occurence of each event.
As we can see 57% of the teams winning the toss decide to Field (Bowl) first and the rest chose to Bat first

Lets look at the Toss decisions across seasons:

2016 was the year there was the hishest divide between teams choosing to feild first vs those chosing to bat first while 2012 was the season with the lowest divide
Unsrprisingly, across years, there is a lot of variation as to how teams choose a decision at a toss. This is can be because of the location the teams were playing: Some conditions call for feilding first while others call for batting first.
Lets see where the 2012 and 2017 season were played.

Turns out, both the 2012 season and 2017 season were played in India! So it might not be the location then. It could just be a shift in strategy by coaches and teams

Lets look at the teams that won the most number of tosses:

As we can see, MI has won the most tosses! also this barplot is very similar to the barplot above labelled "Total Victories of IPL Teams". How is it similar? we can see that most of the teams that are successful are also good at winning tosses! (more precisely, lucky) So a toss can be an important factor in cricket. Please note, it is important to understand that this chart, in no way, indicates that MI and other teams winning a lot of tosses have a higher chance of winning the toss.
That is the teams on the lower end GL, RPS, KTK do not have a bad chance at winning the toss, they just did not play enough games! This is a very important observation (i.e.) This Data is not standardized.

Let's see if the toss winner is also the match winner

Important fact: The probability of winning the toss is equal for both the teams as a 2 sided coin is flipped at the toss.
From the above pie chart, we see that winning the toss does not necessarily mean winning the match!
At this point, It is also nice to appreciate the fact that we have all these inbuilt library functions that make our lives easy. All we have to do is filter data to a format that would suit the visualixation we choose.
Above, we used the matplotlib pie method to create the pie chart.

What about finals?

Lets see how the Toss and other factors impact the tournament decider!

WOW! 83% of teams that win the toss in the decider with the match!
This could just beacuse the team winning the toss is under less pressure but there could definitely be other factors that are not visible in this data. Toss is definitely an important factor when it comes to final

Lets see How a decision after winning the toss affects the outcome of the game:

Above, True=Winning/Won the game and False=Losing/Lost the game
Looks like the team winning the toss choosing to Bat first has won the final the most times. Captains and coaches should definitely take this into consideration! while making a decison in the final!

Now, lets look at a few interesting and important stats.

I will explain what the significance of each statistic is and how it affects the team.

Runs across seasons.

Batters are responsible for the runs each team scores.The more the runs, the harder it is for the team batting second to win the match, this also gives the team bowling second a good leeway to get the batters of the team batting second out. So, in simple terms, The more the runs a team scores, the better chance they have at winning the match. This an indicator of how competetive teams are.
To do this, we create batters_df to store the only batter data we need. The data we need exists in both the matches table and the deliveries table. We will be using the match_id column to our advatage here to merge the two tables(as it is unique for every match) using a left join. into seasons_df. The generated graph shows that, in genral, teams (all together) have increased the number of runs scored across seasons. Just this (isolated) indicates that the tournament was the most competitve in the 2013 season. There is sharp dip from there on in terms of the runs being scored.

Runs per match across seasons.

This is the same statistic as above (runs every season) except calculated/standardized per/to match/number of matches. This gives a more granular look at how the tournamnet has progessed across seasons.
We do this in the same way as above by creating a new dataframe with only the data we need.
The graph genrated indicates the runs scored by both the teams in a match for every match across seasons.
As we can see, the runs being scored per match is increasing with every season. What does this mean? This means that even though there is a decrease total runs accross seasons (the above graph), teams have been scoring more runs in more number of games (rather than a few games with a huge number of runs). This means that the tournament has been getting better and competetive over seasons.

Venue with the most matches!

As we can see, venues in India have the most number of matches!

Score distribution by team by innings

I use BoxPlots to show the lowest, highest and mean batting stats for every team for both the first and second innings.
In the generated graphs:
If we look closely, there is a point near zero in the 2nd innings graph, this is an outlier, as the match could have been disrupted due to rain.
Coming to the observations, Looks like CSK has the best batting average, this means the team consistently scores a lot of runs irrespective of batting first or last.

Total Matches vs Total Wins for each team

Just to try something new, I used the pyplot library to create an interactive graph the represens the Matchs vs Wins for each team. Only code is provided, this wont render as static html but this will create an interactive graph

Conclusion

We've seen a lot of different statistics, alot of ways of visualizing these satatistcs and important observations when it comes to handling data. For some of you reading this, it could your first time reading about, understanding and analyzing cricket. Here are few important takeaways and observations (in terms of Data Science and Cricket):

  1. Mumbai Indians (MI) seems to be the most successful team of the IPL.
  2. Chennai Super Kings (CSK) comes a close second in terms of success but has not won as many trophies as MI.
  3. A Team that wins the toss doesn't have any special shot at winning the game, it is almost equally likely for either team to to win.
  4. In the finals, the team that wins the toss has been winning more games than the team that loses the toss by a considerable margin (about 80-20 percent)
  5. The tournament as a whole has been getting more and more competitive over the years, continuing to grow. (A quick google search, indicates massive increase in viewership numbers across the world.
The one important thing about Data Science that we need to talk about is that correlation does not neccessarily mean causation (it might seem trivial but good to be explicit at times). It is also important to know and understand your data. This will let us handle it well, tidy it and use it efficienlty. As we've seen graphs can sometimes convey the wrong message. For the graph that we showed the number of times a team wins the toss, the graph plainly indicates that some teams win a lot and som teams did not win a lot, It does not take into account how many games each team played. Some teams might have played a lot of games while others not that many, but that graph does not say that!. So, if possible, we should always standardize data! In or example, we could have calculated the same over the number of games each of those teams played.

If you like what you see and are interested to learn more, good knews! Data Science is an amazing open field! You can find resouces and amazing documentation for concepts and tools I've used here! A few are listed below: